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ABSTRACT 

A study examined segmental and suprasegmental 
elements which contribute to an impression of one speaking style as 
opposed to another. A corpus containing three styles of speech 
casual, careful, and read, for the same linguistic content was' 
gathered. Thirteen speakers from Paris, France (aged 2A-35) were 
given a scenario to be acted out over the telephone, and designed to 
or^he'.^Jh^f" the course of the dialog. Examination 

JtJ^ K *v individual results revealed that: (1) spontaneous 
styles of spsech cannot be considered to be linear modifications of 
read speech (careful speech is not necessarily faster than read 
speech, but slower than casual speech, for example), but probably 
closer to separate types of modifications of casual speech; and (2) 
the same perceived style is achieved by different speakers in 
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ABSTRACT 



This paper examines segmental and suprasegmental elements 
which contribute to an impression of one speaking style as 
opposed to another. A corpus containing three styles of speech, 
casual, careful and read, for the same linguistic content was 
gathered. Examination of global and individual results reveals 
Sbat: 1) spontaneous styles of speech cannot be considered to 
be linear modifications of read speech (careful speech is not 
necessarily faster than read speech, but slower than casual 
speech, for example), but probably closer to separate types of 
modifications of casual speech; and 2) the same perceived style 
is achieved by different speakers in different ways. 

1- INTRODUCnON 

One of the many sources of the apparent variability of the 
speech si^al, is a speaker's conscious or unconscious change in 
tne style in which ne expresses himself {!]. Herein, we define 
style to be the expression of information about the dialect and 
socioeconomic background of the speaker, information about 
the manner in which he is expressing himself (formal, casual, 
reading, etc.), and information on the image he has of the 
speaker(s) he is addressing (slowing down for the hard of 
hearing, or foreigners, etc.). Style may overlap, but does not 
encompass, the range of a speaker's emotions and attitudes. 

In this article, we examine the differences in suprasegmental 
and segmental elements of speech, such as intensity or 
devoidng, for different styles of speech. 

As for spontaneous speech where many styles of speech may be 
enumerated, read speech reading a play aloud, a book to a 
child, and a book to a blind person [2] also involves differing 
styles. A comparison of one type of read speech with one style 
of spontaneous n)eech, except in the case of a specific 
apphcation ([3], lor telephone applications), may not rive 
neccssaiy perspective to the significant elements wmch 
contribute to tne perception of a given style. We therefore 
propose to compare two styles of spontaneous g>eech, casual 
conversation ana careful repetitiony to one style of read speech, 
reading a dialogue^ as in reading a play aloud. 

The underlying motivation for this study is to try to better 
understand the variability of speech by looking for the extremes 
of the variability in aven dimensions. A more pragmatic 
purpose is to eventually ameliorate the qualit}^ of synthetic 
speech with better knowledge of what constitutes careful 
speech. Results from this type of study could aid in making 
synthesis more understandaole, on the one hand, and more 
convincin^^ natural (because it could take individual 
charaaenstics into account) on the other hand. 

We used a corpus of speech where, in a fairly spontaneous 
situation, the same linguistic content ^was obtained both in 
casual and careful style, and then the transcription of the casual 
dialogue wasferead by the speaker. 

The following yardsticks were used to characterise each style: 
overall intensity, FO maximum, dynamic ran^e of FO, number of 
pauses, speaking rate, amount or phonological changes, Fl/r2 
O ana the amount of stop bursts. Global values over all 
ir^rs have been compared, as have the values for individual 
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speakers, in view of examining their separate, and differing 
strategies [4]. 



2, A CORPUS OF THREE SPEECH STYLES: SPOT 

A corpus of spontaneous speech was designed, recorded, 
labelled, and verified [5]. 

2.1 Corpus design 

The speech was elicited using a Vizard-of-02"/scenario 
technique designed to provoke changes in style during the 
course of the dialogue. Thirteen Parisian speakers (10 m, 3 0 
ranging in age from 24 to 35 years old were given a scenario to 
be acted out over the telephone with the Vizard". All of the 
speakers, except GA, were given two different scenarios to aa 
out. They were to play the role of a second-year student from a 
computer science school who wants to do a four-month 
research project at LIMSI. The names of the research subjects, 
the pereon's address, and items in his background all included 
phonetic contexts where phonological variants might be 
expected (for example, nes processus schematises" where 
palatalisation could occur at word boundary). The wizard was 
to provoke a change in the speaker's style, from casual to 
core/W speech, while maintaining the same phonetic context for 
the new style. This was done by asking, "Comment?" ("What?") 
twice dunng the conversation (if it did not become too 
evident), just after the speaker had used a sentence having one 
of the desired phonetic contexts. It was expected that tlie 
speaker would repeat basically the same message, rewording it, 
while keeping the same information-orrying words each time. 
From these five-minute dialogues, uie utterances just before 
and after the "What?" were excised. All measures were taken 
only on the parts of the utterances before and after the Vhat?" 
which had approximately the same phonetic content, roughly 
the rheme oi the sentence. Therefore, out of 25 five-minute 
dialogues, only 44 pairs of sentences were left. Obtaining the 
same linguistic material in two different styles of spontaneous 
speech is not easy; our paradigm, in hindsight, although 
amusing to design and record, is not efficient if we compare the 
useful amount of speech to total dialogue time. 

Z2 Cnrpuj; rfecnrding 

The speakers were asked to sit in an office and use a telephone 
which had been fitted with a im'aophone having larger 
bandpass characteristics (100-5000 Hz) than that of the 
telephone microphone. Two audio recordings were made: one 
of the speaker only, and another, of both sides of the 
conversation, through a tap on the wizard's telephone. The 
conversations were also recorded on videotape for later work 
on gestures. The speaker's signal was digitized at 10 kHz and 
stored on a PC-compatible. This is the signal used for the 
measurements below. 

2 J Labelling and verif ying the cnrpuj; 

The speech was phonemicall/ labelled and orthographically 
transcribed. Spectrograms and an FO analysis were ootained 
using the UNICE software (UMSI-VECSYS). 
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In order to verify that the speech was also percdved as changing 
in style, and that the second style was judged to be more careful 
than the first, a juiy of four listened to the pairs of utterances. 

The juiy consisted of the speaker, the wizard, the author, and a 
person not otherwise involved in this database. After listening 
to a pair, they were asked whether one utterance of the pair 
reflected an effort to "make oneself better understood" (literal 
translation of the question as it was asked in French). Ihe 
majority decision determined which pairs were kept Only 24 of 
the 44 original pairs (ten different speakers, 8 m, 2 fj were 
retained! In view of the results per speaker below, we could still 
question the inclusion of speaker RS's utterances, since the 
elements characterising his careful speech are typical of 
Lombard speech * he would have interpreted the request to 
repeat as meaning that there was noise on the line. 

The speakers were also asked to read the ortbo^aphic 
transcnption of their conversations. False starts and hesitauons 
were removed to make reading more fluent The text was 
presented in play form. The read style was set by the person 
who read the wizard's lines as if rehearsing a play. The speech 
with the same linguistic content as in the other two styles was 
excised and labelled. 

In as far as the total amount of data is concerned, it should be 
noted that the quantity of data varies from one speaker to 
another, due to the jury decision. Also, one of the male 
speakers was no longer available at the time the read speech 
was recorded. 

X SUPRASEGMENTAL AND SEGMENTAL PARAMETERS 
STUDIED 

The elements used to characterise the speech were; overall 
intensity, FO maximum, dynamic range of FO, number of 
pauses, speaking rate, amount of phonological changes, F1/F2 
shift, ana the amount of stop bursts. In an earlier comparison of 
the two spontaneous styles [5], the amount of empty words, such 
as "euh**, and the number of incidences of stuttering (as an 
indication of an effort to better articulate) were also examined. 
They were not used here due to the fact that they would 
probably reflect only reading skill in the third style. 

Each measiure was taken individually per utterance, then 
grouping all utterances of each speaker, and finally totalled 
over all n)eakers. It should be noted that, due to the nature of 
our paraaij^ the material on which our measures are based is 
the nigh information content part of the sentence, roughly 
corresponding to the rheme of tne sentence. For example, the 
sentence, "And you announced a project on connectionnist 
models.* before the Vhat?", woula give us " a projea on 
connectionnist models" after the Vhat?". Interestmg studies 
have pursued the difference between high- and low-inSrmation 

Earts of sentences, by labelling the individual words as being of 
igh and low information content [6]. It would be interesting to 
measure our data in this manner also, to see if style is 
expressed only in the individual high content words, or over all 
of the rheme of the sentence. 

3.1 overall intftt^ity 

The mean intensity (in dB) of each utterance was calculated. In 
order to also have an idea of the perceived intensity, we asked 
seven subjects who had not taken part in the experiment so far 
to listen to the 24 pairs of utterances and to indicate whether 
the careful utterance was louder" than the casual one. This 
result is expressed as the percentage of the jury who said that 
the careful version was louder. 

32 FQ maximum and dynamic range of FQ 

Two measures using FO were obtained. First, the mean FO 
value of FO maxima on stable vowels (determined by hand) was 
calculated. Then, the dynamic range of FO for a given utterance 
Y^.5tenmned by subtracting the minimum value of FO for the 
VA\^ utterance from the maximum value (again on stable 



vowels). In order to normalise values over all speakers, it is 
expressed as a percentage of the FO mainmnm value : 

(FOmax - FOmin) / FOmax 

33 number of pauses 

The number of pauses, irrespective of their durations, was 
measured for eacn utterance. It was observed that, in careful 
speech, paxiscs often appeared just before and just after the 
infonnation-containing words of the utterance. It is possible 
that, as for empty words and stuttering, the observauons for 
read speech reflect reading skill, for some speakers. 

3.4 spfialdag_atfe 

literature on the subject of speech called careful, formal, 
or dear, typically indicates that speakers slow down when they 
are trying to be better underetood. Speaking rate is expressed 
as the mean number of phonemes per second. 

3 J phonological changes; 

Using the label character strings, phonological variations 
representing voicing, devoidng, sc£wa deletion, palatalisation, 
and nasalisation were totalled and expressed as a percentage of 
all possible contexts where variants could be present. 

3.6EUE25hifi 

TTie Fl and F2 values of all stable /a/, /i/, and /u/s were 
measured and the mean value taken for each speaxer and for 
all the speakers together for each speech style. 

3>7 presence of stop ralgfljies 

In French, stop releases are either not present or very low in 
amplitude, often depending on their place in a sentence. As 
another means of exploring the effort to better articulate, the 
number of stop releases present was expressed as a percentage 
of the total nimiber of stops pronounced (nasals were not 
counted here). 

4. RESULTS 

4.1 global results 

The table below gives the global results. 
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% of all stop bursts 


74.6 - 15 


86.0 • 6 
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FX neaD value /u/ 
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416 


375 




FX aeaD value /a/ 
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Fl meaD value fx/ 
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309 


267 




F2 neafi value /u/ 


102S 


885 


1010 




F2 meaD value /a/ 
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1593 


1593 




F2 aeas value fx/ 


2173 


2192 


2180 
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Two observations may be made. First, some studies of style 
change seem to indicate that one style may be modelled as a 
degree more (or less) of various segmental and suprasesmental 
elements than another. Careful style, for example, would simply 
be slower than casual speech, but fister than read speech. Our 
data do not zgrct with this. Read speech, for exanqAe, is much 
more "esmressive", as measured by the dynamic range of FO, 
than caretul speech is. This would argue for another manner of 
studying speech style. Instead of viewing read speech the base 
upon which modifications are made toward other s^Ies of 
speech, it would be more logical (although more difficult) to 
use casual speech as the base, with each speech style bein^ a 
unique type of modification. There is a parallel here with 
language acquisition: a child speaks in a casual style until age 
five or six, when he learns other styles, such as reading or 
speaking in a re^ctful way to teachers. Each of these styles is 
learned separatejy and may be viewed as a modification of his 
casual speech. The data in the following sections will be 
evaluated in this matiner. 

The second observation concerns the very large standard 
deviations. Observing only the totals of the data over all 
speakers implies that we presume that all ^;>eakers express style 
differences m the same manner. The stancfard deviation values 
here point to the fact that speakers are in fact using different 
strategies to achieve the same perceived result. 

42 Individual results 

In order to examine the differences in the strategies of the 
individual speakers, let us look at the results for speakers PR 
and GA. 

First let us look at these speakers* behaviours in the use of 
intensity as a means of expressing style chanse. literature on 
the subject of formal (careful) speech ^ically indicates that 
speakers increase intensity when they are trying to be better 
understood. The graph below plots the actual intensity 
measurements compared to the percentage of the jury that 
perceived the car^ speech to be louder than casual speech. 
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FR and GA, althoueb perceived to be speaking louder by over 
50% of the jury, did not actually produce louder speech. Other 
elements neeo to be investigateo to determine wnat gave the 
jury the impression that they were speaking louder and more 
carefully. If we look at our measures of FO, we have the results 
atthenght 

FR*$ careful speech has a higher FO, and has a slightly larger 
pitch range. uA also increases FO in careful, as opposed to 
casual speech, but decreases the pitch range. The inaease in 
FO may account for the impression of louder speech. Further 
examination, into the segmental elements which may reflect an 
effort to articulate more precisely, reveals the following results 
O onological variants and stop releases: 
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We see a dear effort to articulate better on the part of GA, for 
careful speech as opposed to casual speech, which is not the 
case for FR. If we look at the results tor read speech, we see 
that FR and GA have a deaeased amount of phonolodcal 
variants as compared to casual q>eech, but do not make a 
particular effort concerning stop releases. GA seems to make 
an overaU effort to articulate better for careful speech, whereas 
moreprecise artiuilation is diaracteristic of read speech for 
FR. The results for speaking rate and diange in F1/F2 show no 
eutistiaiUy sig^cant variation from one style to another for 
these two q)eakers. We include the results for speaking rate 
here to show the difference between the results for FR and G A 
and those of AV. 

4 



^ SPEAKING RATE 



li 

14 
12 
10 
t 

4 

2 

f 











1 1 1 





CA 



I cartfiU gSSS read 



Due to lack of space, aU of the results for all speakers cannot 
be shown. Some other significant strategies: 

• For AV: She increased maximum FO and FO dynamic and 
spoke more slowly for careful speech, as opposed to casual 
j^ecb. For read speech, she decreased intensity and the 
amount of phonological variants. 

• For PHE: Read speech was characterised by an increase in 
FO dynamic, and a decrease in intensity and the number of 
phonological variants. 

5. INTERPRETATION OF THE RESULTS 

The results show that individual speakers do express speech 
styles in different ways. Although certain elements found m the 
literature, such as slowing down and speaking louder for a 
careful style, are used by some speakers, it is not th£ case for 
everyone. Other elements, such as an increase in the dynamic 
range of FO, are used by other speakers who are also perceived 
to be making an effort to be better understood. One of the 
reasons that careful speech is not always characterised by an 
increase in intensity may be (in the case of AV for example, 
who is an assistant professor) that teachers tend to lower their 
voices rather than raise them to get students to listen more 
closely. 

Some of the results for read speech may be artifacts related to 
reading skills. An assessment of the reading skills of each 
iq>eaker should help clarify this point. 

Certain styl' changes are marked by statistically significant 
data, but this is not the case for all speakers and all styles. It is 
probable that the expression of, for example, carefiu speech, 
was not as strong for all speakers. The familiarity of certain 
speakers with the wizard and the desire to play the role as well 
as possible may be the causes here. It is certainly possible to 
imagine different degrees of a given style, such as careful speech 
with a complete stranger, careful speech with a good friend, 
careful speech with a smalf child. In this case it might be 
possible to place the change on one same axis, the speech 
slowing down when speaking to a small child rather tfian ? 
good friend. 

A model of a real voice rather than a composite one for high 
quality ^thetic speech must take these individual strategies 
mto account if it is to be convincing. Speech recognition may 
also benefit from better understanmng of the strategies used 
here, being able to predict a number of variations for a given 
speaker from a small sample of his speech. 



& CONCLUSIONS 

We have gathered and analysed casual, carefiil, and read 
^eecb from several speakers. Our results on this data show 
that each style is characterised by different elements. 
Classically, studies have considered read speech as a 
conservative starting point on a linear axis where casual speech 
would be at the other end Casual speech seems to us to be a 
better base from which to find those elements than read speech 
does. 

We have also shown that speakers use different strategies to 
achieve the same perceived result 

As mentioned above, this type of data is difficult to obtain. We 
are presently recording another database for style comparison 
that includes a larce number of speakers. Tlie paradigm used 
furnishes more usable speech per recording, and will be used to 
confirm our findings. 

We also intend to soon test our findings by manipulating 
synthetic ^ech and evaluating the perception of the results. 

Many more studies, involving other style changes, relations 
between elements of different nature (phonological and 
prosodic, for example), and languages other than French also 
need to be carried out in order to aid our comprehension of the 
limits of individual variability. 
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